Search CORE

13 research outputs found

A bioinformatics knowledge discovery in text application for grid computing

Author: A Hotho
AM Cohen
D Talia
EG Talbi
Gianfranco Tarricone
Giuseppe Mastronardi
H Shatkay
I Foster
IH Witten
M Castellano
M Castellano
Marcello Castellano
P Zweigenbaum
PC Carvalho
R Mooney
RC Bunescu
Roberto Bellotti
U Leser
UM Fayyad
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background A fundamental activity in biomedical research is Knowledge Discovery which has the ability to search through large amounts of biomedical information such as documents and data. High performance computational infrastructures, such as Grid technologies, are emerging as a possible infrastructure to tackle the intensive use of Information and Communication resources in life science. The goal of this work was to develop a software middleware solution in order to exploit the many knowledge discovery applications on scalable and distributed computing systems to achieve intensive use of ICT resources. Methods The development of a grid application for Knowledge Discovery in Text using a middleware solution based methodology is presented. The system must be able to: perform a user application model, process the jobs with the aim of creating many parallel jobs to distribute on the computational nodes. Finally, the system must be aware of the computational resources available, their status and must be able to monitor the execution of parallel jobs. These operative requirements lead to design a middleware to be specialized using user application modules. It included a graphical user interface in order to access to a node search system, a load balancing system and a transfer optimizer to reduce communication costs. Results A middleware solution prototype and the performance evaluation of it in terms of the speed-up factor is shown. It was written in JAVA on Globus Toolkit 4 to build the grid infrastructure based on GNU/Linux computer grid nodes. A test was carried out and the results are shown for the named entity recognition search of symptoms and pathologies. The search was applied to a collection of 5,000 scientific documents taken from PubMed. Conclusion In this paper we discuss the development of a grid application based on a middleware solution. It has been tested on a knowledge discovery in text process to extract new and useful information about symptoms and pathologies from a large collection of unstructured scientific documents. As an example a computation of Knowledge Discovery in Database was applied on the output produced by the KDT user module to extract new knowledge about symptom and pathology bio-entities.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Archivio istituzionale della ricerca - Università di Bari

Linguistic feature analysis for protein interaction extraction

Author: A Airola
A Moschitti
A Yakushiji
B Schölkopf
C Cortes
C Giuliano
C Nedellec
CC Chang
Chris Cornelis
D Haussler
H Lodhi
J Ding
J Xiao
JH Eom
K Fundel
M Collins
Martine De Cock
MF Porter
R Bunescu
R Saetre
RC Bunescu
S Katrenko
S Kim
S Pyysalo
S Pyysalo
S Van Landeghem
T Fayruzov
T Fayruzov
Timur Fayruzov
Veronique Hoste
Y Saeys
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background The rapid growth of the amount of publicly available reports on biomedical experimental results has recently caused a boost of text mining approaches for protein interaction extraction. Most approaches rely implicitly or explicitly on linguistic, i.e., lexical and syntactic, data extracted from text. However, only few attempts have been made to evaluate the contribution of the different feature types. In this work, we contribute to this evaluation by studying the relative importance of deep syntactic features, i.e., grammatical relations, shallow syntactic features (part-of-speech information) and lexical features. For this purpose, we use a recently proposed approach that uses support vector machines with structured kernels. Results Our results reveal that the contribution of the different feature types varies for the different data sets on which the experiments were conducted. The smaller the training corpus compared to the test data, the more important the role of grammatical relations becomes. Moreover, deep syntactic information based classifiers prove to be more robust on heterogeneous texts where no or only limited common vocabulary is shared. Conclusion Our findings suggest that grammatical relations play an important role in the interaction extraction task. Moreover, the net advantage of adding lexical and shallow syntactic features is small related to the number of added features. This implies that efficient classifiers can be built by using only a small fraction of the features that are typically being used in recent approaches.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

Ghent University Academic Bibliography

PubMed Central

Mining clinical relationships from patient narratives

Author: A Rector
A Roberts
A Roberts
A Roberts
Angus Roberts
C Blaschke
C Friedman
C Giuliano
C Grover
C Nédellec
CB Ahlers
D Klein
D Lindberg
D Zelenko
Defense Advanced Research Projects Agency
G Doddington
G Zhou
H Cunningham
H Harkema
J Pustejovsky
J Thomas
K Fundel
M Goadrich
Mark Hepple
N Chinchor
N Sager
P Zweigenbaum
R Bunescu
R Gaizauskas
RC Bunescu
Robert Gaizauskas
S Katrenko
S Miller
S Pakhomov
T Rindflesch
T Wang
TC Rindflesch
U Hahn
W Chapman
Y Li
Y Lussier
Yikun Guo
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Background The Clinical E-Science Framework (CLEF) project has built a system to extract clinically significant information from the textual component of medical records in order to support clinical research, evidence-based healthcare and genotype-meets-phenotype informatics. One part of this system is the identification of relationships between clinically important entities in the text. Typical approaches to relationship extraction in this domain have used full parses, domain-specific grammars, and large knowledge bases encoding domain knowledge. In other areas of biomedical NLP, statistical machine learning (ML) approaches are now routinely applied to relationship extraction. We report on the novel application of these statistical techniques to the extraction of clinical relationships. Results We have designed and implemented an ML-based system for relation extraction, using support vector machines, and trained and tested it on a corpus of oncology narratives hand-annotated with clinically important relationships. Over a class of seven relation types, the system achieves an average F1 score of 72%, only slightly behind an indicative measure of human inter annotator agreement on the same task. We investigate the effectiveness of different features for this task, how extraction performance varies between inter- and intra-sentential relationships, and examine the amount of training data needed to learn various relationships. Conclusion We have shown that it is possible to extract important clinical relationships from text, using supervised statistical ML techniques, at levels of accuracy approaching those of human annotators. Given the importance of relation extraction as an enabling technology for text mining and given also the ready adaptability of systems based on our supervised learning approach to other clinical relationship extraction tasks, this result has significance for clinical text mining more generally, though further work to confirm our encouraging results should be carried out on a larger sample of narratives and relationship types

Crossref

Springer - Publisher Connector

PubMed Central

White Rose Research Online

HypertenGene: extracting key hypertension genes from biomedical literature with position and automatically-generated template features

Author: A Rzhetsky
AK Ramani
B Rosario
C Blaschke
Chi-Hsin Huang
F Sha
H-W Chun
Hong-Jie Dai
HW Chun
J Lafferty
J Nocedal
J Xiao
JN Darroch
K Becker
K Hirohata
M Bundschus
M Craven
M Masseroli
M Shimbo
N Kambhatla
P Ruch
Po-Ting Lai
R Bunescu
R Weissberg
RC Bunescu
Richard Tzong-Han Tsai
RT Tsai
RTK Lin
T Ono
T Rindflesch
TC Rindflesch
TF Smith
TH Tsai
Wen-Harn Pan
Wen-Lian Hsu
Y Yamamoto
Yen-Ching Chang
Yue-Yang Bow
Z GuoDong
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background The genetic factors leading to hypertension have been extensively studied, and large numbers of research papers have been published on the subject. One of hypertension researchers' primary research tasks is to locate key hypertension-related genes in abstracts. However, gathering such information with existing tools is not easy: (1) Searching for articles often returns far too many hits to browse through. (2) The search results do not highlight the hypertension-related genes discovered in the abstract. (3) Even though some text mining services mark up gene names in the abstract, the key genes investigated in a paper are still not distinguished from other genes. To facilitate the information gathering process for hypertension researchers, one solution would be to extract the key hypertension-related genes in each abstract. Three major tasks are involved in the construction of this system: (1) gene and hypertension named entity recognition, (2) section categorization, and (3) gene-hypertension relation extraction. Results We first compare the retrieval performance achieved by individually adding template features and position features to the baseline system. Then, the combination of both is examined. We found that using position features can almost double the original AUC score (0.8140vs.0.4936) of the baseline system. However, adding template features only results in marginal improvement (0.0197). Including both improves AUC to 0.8184, indicating that these two sets of features are complementary, and do not have overlapping effects. We then examine the performance in a different domain--diabetes, and the result shows a satisfactory AUC of 0.83. Conclusion Our approach successfully exploits template features to recognize true hypertension-related gene mentions and position features to distinguish key genes from other related genes. Templates are automatically generated and checked by biologists to minimize labor costs. Our approach integrates the advantages of machine learning models and pattern matching. To the best of our knowledge, this the first systematic study of extracting hypertension-related genes and the first attempt to create a hypertension-gene relation corpus based on the GAD database. Furthermore, our paper proposes and tests novel features for extracting key hypertension genes, such as relative position, section, and template features, which could also be applied to key-gene extraction for other diseases.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central